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Abstract: We introduce a random partition model for Bayesian nonparametric regression. 

> 

CN ■ The model is based on infinitely-many disjoint regions of the range of a latent covariate- 

i> . 

■ dependent Gaussian process. Given a realization of the process, the cluster of dependent 

(N 

variable responses that share a common region are assumed to arise from the same distribu- 
tion. Also, the latent Gaussian process prior allows for the random partitions (i.e., clusters 
of the observations) to exhibit dependencies among one another. The model is illustrated 
through the analysis of a real data set arising from education, and through the analysis of 
simulated data that were generated from complex data-generating models. 
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1 Introduction 



Regression modeling is ubiquitous in many applied research areas. In regression studies, the 
objective is to estimate specific distributional aspects of a dependent variable Y, conditional 
on covariates x = (xi, . . . , x p ) J of interest, from a sample data set T> n = {(yi, Xj)}™ =1 , which 
for notational convenience will be written as X n = (xj)™ =1 and y = (y±, . . . ,y n ) J - 

Indeed, for Bayesian nonparametric regression, much research has focused on developing 
random partition models (RPMs) that follow the general form: 

/(yix n , P j = nn^i**'**) 

d=l i£S d 

6 d ~ G 

Pn ~ 7T(Pnl X n)- 

In the above, p n = {Sd} d 2i denotes a partition of the indices {1, . . . , n} of the sample data 
{(yi, Xj)}" =1 into K n distinct clusters, and 7r(pjX n ) denotes an RPM. These RPM's provide a 
very broad class of models that encompasses product partition models (PPMs), species sam- 
pling models (SSMs), and model-based clustering (MBC); see Quintana (2006) for a review. 
A PPM is of the form 7r(pjX n ) = cq Yld=i c (Sd\X n ), with cohesion functions c(Sd|X n ) > 
(Hartigan, 1990; Barry & Hartigan, 1993), PPMs have been developed for Bayesian non- 
parametric regression (Muller & Quintana, 2010; Park & Dunson, 2010; Miiller et al., 2011). 
A SSM assumes the form 7r(p n |X n ) = vr Xn (|5'i|, |SkJ) (Pitman 1996; Ishwaran & James, 
2003). The Dirichlet process (Ferguson, 1973), a popular Bayesian nonparametric model, can 
be characterized either as a special PPM or as a special SSM. On the other hand, with MBC, 
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a random partition p n = {S d = {i : d; L = d}}^ is formed by sampling latent indicators d i} 
(i = 1, . . . , n), from weights Uj of a discrete mixture model. 

In this paper, we develop and illustrate a novel Bayesian nonparametric regression model, 
which may be characterized as an RPM. The model randomly partitions the n observations 
into distinct clusters that each share a common region of a transformed covariate space, and 
then given the covariate x, uses the dependent responses in the covariate region to predict 
Y . Specifically, the novel regression model is based on a fixed partition (A,-) of the range 
R = WjL_ OQ (Aj = (J — and a Gaussian Process (GP) z(x) which induces a random 
partition by p n = {S d = {i : zfe) E A d }}^ v 

To further elaborate, consider the standard Bayesian nonparametric mixture model, with 
latent variable for the component. Such a model is given by 



The d classifies which component f{y\9) the observation y comes from and the weight Wd is 
the population probability of coming from component d. There has been much debate and 
proposals as to how covariates x enter into such a model in a meaningful way. Following the 
RPM idea it makes most sense that if x and x' are close then observations y and y' would be 
expected to come from the same component. Hence, it is appropriate to make the d depend 
on x. A convenient way to achieve this is via a Gaussian process z(x), such that 



f{y,d) = w d f(y\9 d ). 
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where (A,) is a fixed partition of M, i.e. Uf A,- = IR and A,- D Ay = for j ^ j'. 

The usual idea of having the weights depend on x in the form ojj(x) and having ojj(x) 
close to Wj(x') whenever x is close to x', is a rather weak condition. While in this case the 
densities for y and y' may be close to each other, there is no suggestion that y and y' are 
coming from the same component, which is the more realistic notion. So what is needed is 
to have y close to y' in probability, rather than simply close in distribution. 

Therefore, the proposed model is given by 

f(y,d\z,x) = l(z(x)eA d )f(y\9 d ). 

So 

f{y\z^) = Y j m*)^A ] )f{y\d ] ) 

j 

and 

/(y|x) = X)a; i (x)/(y|e i ) 

3 

where 

Wj(x) = PO(x) g AA 

In Karabatsos and Walker (2012), this model was employed where z(x) ~ n(r/(x), <r 2 (x)). 
It was explained in that paper how er(x) controlled the modes of /(y|x) and why this was 
an important aspect of the model in keeping with the idea that x close to x' determines y 
and y' coming from the same component. That is, for x close to x', it can be that Wj(x) 
and Wj(x') are both close to 1 for some j. Henceforth, we refer to Karabatsos and Walker's 
(2012) model as the "independence model", because it assumes independent latent variables 
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z(x) ~ n(?7(x), cr 2 (x)) and z(x') ~ n(r/(x'), <7 2 (x')) for any two distinct covariates x and x'. 

In the present paper we acknowledge that it would be further desirable for the z process 
to be constructed with dependence; i.e. 

z(.)~GP[r,(.),a(;-)}. 

This will reinforce the notion that it is required for O0j(x) and cjj(x') to both be close to 1 
when x is close to x'. The dependent Gaussian process facilitates this to a greater extent 
than under the independent process. 
In terms of a RPM, we have 

P(di,...,d n |xi,...,x n ) =P(z(x 1 ) e A dl ,...,z(x n ) e A dn ) . 

This is an appealing version of a probability for the partition as it marginalizes from higher 
to lower dimensions, addressing the curse of dimensionality. Also, it is clear that since 
our model allows for the GP to exhibit dependencies among the latent variables z(xi) G 
A dl , . . . ,z(x n ) G Ad n i it is in a sense more flexible than a PPM because it does not force 
partitions under the assumption that 7r(p n |X n ) is a product prior. 

— Insert Figure 1 — 

Figure 1 illustrates the mixture weights Wj(x) and the resulting predictive densities f{y\x) 
of the model, for a single covariate x = x having observed values x± — 1, x-i — 1.3, and 
X2 = 4. Also, the figure assumes rj{x\) = —.30, 77(^2) = .21, 77(0:3) = 4.8, and the squared- 
exponential covariance function a(x, x') = a 2 c exp(— .5| \x — x'|| 2 ), and presents the weights 
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and the densities for small a 2 c = .01 and for large o 2 c = 10. Throughout, || • || denotes the 



the resulting predictive densities f(y\x) are similar when x and x' are close. The weights 
and densities become more dissimilar as the distance between x and x' increases. Also, the 
parameter a 2 c controls the number of modes in /(y|x). At one extreme, as a 2 c decreases, 
/(y|x) becomes more unimodal. As o 2 c increases, /(|/|x) becomes more multimodal. 

We now describe the layout of the rest of the paper. In Section [2] we fully present 
our regression model. In Section [31 we illustrate our model through the analysis of real 
and simulated data sets. In so doing, we compare the predictive performance of our new 
model, against the previous version of our regression model which assumes independent latent 
variables z(xi), . . . , 2(x n ), and against another regression model that is known to provide 
good predictive performance. Section H] concludes with a discussion. 

2 The Regression Model 

For a sample set of data V n = {(xj, yi)}f =1 , our Bayesian nonparametric regression model has 
parameters £ = /3, a 2 ,, <f>), along with latent indicator parameters d = (g^, . . . , o? n ) T . 

The model is defined by: 



Euclidean norm. As shown, when either o 2 c 



is small or large, the mixture weights Uj(x) and 





oo 




(lb) 





(lc) 





(Id) 
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where X ln = (1, xj) nx ( p+1 ), while n K (-|-,-) and ga(-|-,-) are respectively the probability 
density functions of the i^-variate normal and gamma distributions (shape and rate param- 
eterized). As shown, the model is based on a GP, with mean function Xi n /3 and covariance 
function matrix cr^C^Xj, Xj)) nxn , where (3 = (/3 , f} x , . . . , /3 p ) T , and where £</,(•, ■) is a corre- 
lation function that depends on the parameter tp. 

A standard choice of kernel densities is provided by univariate normal densities f(-\Oj) = 
n(-|/^, aj) (j = 0, ±1, ±2, . . .), which may be assigned conjugate prior density: 

oo 

7T(0) = 7T(/X,<T 2 ) = Yl ^j\^j^lj)S^j 2 \^aj,^aj)- 
j=-oo 

For the covariance function o 2 c C^[-^ •), possible choices of the correlation function include the 
powered-exponential family C^(x, x') = exp(— 0-Jlx — x'H^ 2 ) (for <p l > 0; < <ft 2 < 2), the 
Cauchy family, the Matern family, as well as families of correlation functions that are either 
non-stationary or non-isotropic (e.g., Rasmussen & Williams, 2006). 

Given data V n likelihood Hi=i f (Vil^i'X) , with f(yi\xi; C) = /(2/l#jVi( x ;A &h 4>) 
(see Section 1), and a proper prior density 7r(C) defined over = {£}, the posterior density 
of C is proper and given by: 

n 

7T(CPn)0Cn/to;CMC) 

i=l 

up to a proportionality constant. Then the posterior predictive density of Y is defined by: 

/„(y|x)= f /(y|x; CMC|2>n)dC, 
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with this density corresponding to posterior predictive mean and variance 



E n (Yi|x 4 ) = Jyf n (y\xi)dy; Var n (F i |x i ) = J{y - E(Y;|x;)} 2 f n (y\xi)dy. 



In the present paper, in applications of our regression model, our emphasis is in prediction 
rather than inference of the model parameters Hence, we focus statistical inferences on 
the posterior predictive density f n (y\x), and functionals of interest. 

The posterior densities 7r(£|X> n ) and f n (y\x) can be estimated by using standard Gibbs 
MCMC sampling methods for infinite-dimensional models, which make use of strategic la- 
tent variables (Kalli, Griffin, & Walker, 2010). The Appendix provides more details. Also 
the Appendix describes how the model and corresponding MCMC methods can be easily 
extended to handle the analysis of censored observations, discrete dependent variables, and 
the analysis of spatial or spatio-temporal data via an appropriate modification of the GP 
covariance function. 

2.1 Model Assessment and Comparison Methods 

After M regression models are fit to a data set V n , the predictive performance of each model 
m G {!,..., M} can be assessed by the mean-square predictive-error criterion 



(Gelfand & Ghosh, 1998). The criterion is often used in the practice in the assessment 
and comparison of Bayesian models (e.g., Gelfand & Banerjee, 2010). The first term of (jSJ) 




(2) 
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measures data goodness-of-fit, and the second term is a penalty that is large for models 
which either over-fit or under-fit the data. The criterion ([2]) can be re-written as 

n „ n 

D(m) = (y- yi) 2 f n (y\xi,m)dy = J^E n [(^ - y) 2 |x;,m]. 

i=l J i=l 

So the estimate of D(m) is obtained by generating posterior predictive samples yj"** 1 |xj 
(i — 1, . . . , n) at each iteration s = 1, . . . , 5* of the MCMC chain, and then taking 

-. S n g n 

5 (-) = ^EE(^-^ prcd(s) ) =E 5 *H- 

s=l i=l i=l 

where the individual quantities Di(m) can be used to provide a more detailed assessment 
about a model's predictive performance. The Appendix provides some more details about 
the MCMC methods for estimating D(m). 

3 Illustrations 

3.1 Math Teaching Data 

Here we illustrate the proposed model of equation (Tj[|), through the analysis of data that 
were collected to study a new undergraduate teacher education curriculum, instituted in 
2009 by four Chicago-area universities. The study aimed to evaluate the impact of the new 
curriculum on the ability to teach math among n = 89 of its second- year students. Impact is 
measured by a dependent variable called "change" (mean=.80; s.d.=.6), which is the change 
in math teaching ability score of the student, from before (pre-test) and after (post-test) 
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completing a course in math teaching. Also, there are three covariates. The first covariate 
is lmtl40, where lmtl40=l if the course is learning of math teaching (lmt) level 140, and 
lmt 140=0 if the course is lmtl41 (mean=.73; s.d.=.6). The second covariate is uic, which 
is a 0-1 indicator of whether the student is from the University of Illinois-Chicago, versus 
one of the other three universities (mean=.60; s.d.=.5). The third covariate is pretest score 
(mean=— .83; s.d.=.8). Each of the three covariates were z-standardized to have mean zero 
and variance 1, prior to data analysis. 

For the regression model presented in equation (JTJ) , we assumed a squared-exponential 
covariance function for the GP, given by er^C^x, x') = o 2 c exp(— .5| |x — x'|| 2 ). Also for 
this model we assigned mostly high- variance priors fij ~ iid n(/i At = 0,cx 2 = 10), aj 2 ~ iid 
ga(l, 10~ 3 ), /3\o~c ~ n(0, a 2 10 5 I p+ i), and a^ 2 ~ ga(l, 10 4 ), to reflect the relative lack of prior 
information about these parameters. The gamma prior for a 2 c reflects our prior belief that 
the conditional density of the change score, f(y\x.), tends to be unimodal. This implies 
the belief that the covariates x tend to be informative about the dependent variable. To 
estimate the model, a total of 150,000 sampling iterations of the MCMC algorithm were run 
(see Appendix), and the last 75,000 samples were used to estimate the model's posterior 
distribution (after burn- in). The posterior predictive samples and the D(m) criterion stabi- 
lized over the MCMC iterations, and the resulting posterior predictive and D(m) estimates 
had near- zero 95% Monte Carlo confidence intervals (MCCI) according to a consistent batch 
means estimator (Jones et al., 2006). 

— Insert Figures 2 and 3 — 

Figure 2 presents the posterior predictive mean and variance estimates of the change 
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dependent variable, conditional on the three covariates (lmtl40, uic, pretest). The relation 
between change score and pretest score is quite nonlinear. Figure 3 presents the posterior 
predictive density estimates of the change score, for a range of values of the pretest scores, 
while conditioning on uic=l and lmtl40=l. These densities are shown to be skewed and 
unimodal. In summary, the result show that the new teacher education curriculum tended 
to have a positive effect on mathematics teaching ability, over time. 

We also analyzed the data using a simpler version of our regression model flTJ, namely the 
"independence model" (see Section[T|), for which we specified z(xj) ~ ind n((l, xJ)/3, o"^(xj)), 
with o"c( x ) = exp ( ( 1 , xj ) A) , along with priors fij ~ iid n(0, 10), aj 2 ~ iid ga(l,10~ 3 ), and 
(/3, A) ~ n(0, 10 5 l2( p +i)). Thus, this independence model assumed the same priors for 
(/x, cr 2 ), as in the GP-based model described earlier. A previous study (Karabatsos & 
Walker, 2012) showed that the independence model tended to have better predictive perfor- 
mance than 26 other regression models (according to the D(m) criterion), over many real 
and simulated data sets, with the BART model (Bayesian Additive Regression Trees model; 
Chipman, et al. 2010) being among the more competitive models. For the data set under 
current consideration, the independence model was estimated by 150,000 MCMC sampling 
iterations, after discarding the first 75,000 samples (burn-in) (see Appendix A of Karabatsos 
& Walker, 2012). Also, the BART model was fit to the data set, via the generation of 42,000 
posterior samples via a Bayesian back-fitting algorithm. For each of these two models, the 
samples of the D{m) criterion stabilized over sampling iterations, and the resulting D{m) 
estimate had a small 95% MCCI. 

For the current data set of the n = 89 students under the new teacher education curricu- 
lum, our GP-based regression model ([T]) had a better predictive performance (D(m) = 1.3; 
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MCCI=±.l)) than the independence model (D{m) = 5.1; MCCI=±.8)), and the BART 
model (D(m) = 52.5; MCCI=±.04)) which was fit by implementing the BayesTree pack- 
age of the R statistical software (Chipman & McCulloch, 2010). Also, for the GP-based 
model, had no outliers, as the -Dj(m) estimates over the n = 89 observations had a 5- number 
summary (i.e., min, 25%ile, 50%ile, 75%ile, and max) of {.00, .01, .01, .02, .11}. For the in- 
dependence model, it was {.01, .02, .03, .06, .70}. In summary, it seems that the predictive 
accuracy of the regression model can be substantially improved by accounting for dependence 
among the latent variables (z(xi), . . . ,z(x n )). In the next subsection, we use a simulation 
study to further investigate this issue. 

3.2 Complex Regression Functions 

Here, using a range of complex data-generating models, we conduct a simulation study to 
compare the predictive performance between the GP-based regression model, the indepen- 
dence model, and the BART model. They include data-generating models where /(y|x) is 
a unimodal sampling density, with mean depending on complex functions of x. They also 
include data-generating models where /(y|x) is a multimodal sampling density, having mean 
and number of modes that also depend on complex functions of x. 

For the unimodal /(?/|x) setting, we consider two data-generating models which respec- 
tively assumed the following mean functions for the dependent variable: 



1.9[1.35 + exp(xi) sin(13(x 



.6) 2 ) exp(-x 2 ) sin(7x 2 )], 



(3) 



E 4 (r|x) 



10sin(7rxix 2 ) + 20(x 3 - .5) 2 + 10x 4 + 5x 5 + YlkLe ® x k- 



(4) 
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Equation (J3J) is a complex 2-dimensional covariate interaction (Hwang, et al., 1994). Equa- 
tion (J4J) is a complex function of ten covariates, with 5 covariates irrelevant (Friedman, 1991). 
With respect to these two functions, we generated a data set of n — 225 observations from 
n(?/j|Ei(y |xj), .0625)u2(x.j|0, 1), and we generated another data set of n — 100 observations 
from n(yj|E 4 (F|xj), cr^)uio(xj|0, 1), for i = l,...,n, where UR-(Xj|0, 1) denotes the density 
function of a .fT-variate uniform distribution. 

We simulated two additional data sets under settings where /(y|x) is multimodal and 
based on mixtures of normal densities, 10 covariates x, and with the number of modes 
depending on x. Specifically, the number of modes iV mod (x) in the density /(y|x) ranged 
from 1 to 4, via the function -/V mod (x) = min(max(floor(E 5 (y|x)), 1), 4), with E 5 (y|x) = 
(3, 1.5, 0, 0, 2, 0, 0, 0)(xi, . . . , x$) J ■ The four modes are respectively defined by Ei(F|x), E 2 (y |x), 
E 3 (y|x), and E 4 (y|x), with Ei(y|x) and E 4 (y|x) given by J3J and Q3J), along with 

E 2 (y|x) = (-2x 1 ) I{:El< - 6) (-1.2x 1 ) I(:E ^- 6) + cos(57rx 2 )/(l + 3x2), 
E 3 (y|x) = ((x 1 ,x 2 ,x 3 ,x 4 )(3) 2 exp((x 1 ,x 2 ,x 3 ,x i )(3)-f3 = (2,1,1, l) T /7^ 2 . 

We simulated a data set of n = 100 observations from a sampling density n(yi\E d . (y|xj), of) 
xuio(xj|0, 1), for i = l,...,n, with d{ ~ u{l} when iV mod (xj) = 1, d{ ~ u{l,2} when 
A r mod(x i ) = 2, di ~ u{l,2,3} when A^ mod (xi) = 3, and d { ~ u{l,2,3,4} when iV mod (xi) = 4. 
Also, we simulated another data set, of n = 225 observations, from the same density. 

— Insert Table Q] — 

To analyze each of the four simulated data sets described in this subsection, our GP-based 
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regression model, and our independence model, each assumed priors fij ~j.j.d.n(/i, 100), with 
Jl the empirical mean of the simulated Y. Otherwise, each of these models assumed the 
same priors for their other parameters, and the GP-based model assumed the same squared- 
exponential covariance function for z-standardized covariates, as in the previous subsection. 
Moreover, each of these two models were estimated according to 150,000 MCMC sampling 
iterations, after discarding the first 75,000 samples (burn-in). Also, the BART model was 
fit to each of the four data sets, via the generation of 300,000 posterior samples. For each 
of the three models, the D(m) criterion stabilized over MCMC iterations, and the resulting 
D{m) estimate yielded a small 95% MCCI. 

Table [1] presents the results of the simulation study, in the comparison of the three mod- 
els, in terms of the mean-squared predictive error D(m). The GP-based model obtained the 
best D(m) predictive performance for all the simulated data sets. Also, for the GP-based 
model, the individual Di(m) predictive errors tended to be quite small, even though the true 
data-generating models were quite complex. Specifically, for the 2-dimensional unimodal, 
10- dimensional unimodal, the multimodal (n = 100), and the multimodal (n = 225) simu- 
lated data sets, the model obtained Di{m) 5-number summaries (i.e., min, 25%ile, 50%ile, 
75%ile, and max) of {.01, .01, .02, .02, .15}, {.01, .02, .03, .04, .10}, {.00, .02, .02, .03, .07}, and 
{.01, .02, .02, .04, 4.8}, respectively. 

4 Conclusions 

We have described a Bayesian nonparametric regression model, and demonstrated the suit- 
ability of the model through the analysis of both real and simulated data sets. The key 
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idea of the paper is that close covariates x and x' should result in y and y' being close 
in probability, rather than in distribution, which has led to the current prevailing model 
constructions. Close in probability suggest outcomes from close covariates share a common 
component distribution which is, in our case, modeled as a normal distribution. For this to 
happen the weights at a particular component value for these similar covariates should both 
be close to 1, and to facilitate this a dependent Gaussian process is the most suitable model. 
Hence, all the aspects of the model play a clear discernible role. 

Appendix: MCMC Algorithm 

Our infinite-dimensional regression model can be estimated via the implementation of the 
MCMC sampling methods of Kalli et al. (2010). This method involves introducing strategic 
latent variables, to implement exact MCMC algorithms for the estimation of the model's 
posterior distribution. Specifically, for our regression model (Section [2]), we introduce new 
latent variables (itj)" =1 , and a decreasing function £ d = exp(— \d\), such that the model's 
data likelihood can be rewritten as the joint distribution: 

n n 

Hfiyud^u^z) = J]{1(0 <u, t < i^Qlf^OdM^i) e A*)}. (5) 
i=i i=i 

Marginalizing over the latent variables Ui in (jSJ), for each i = 1, . . . ,n, returns the original 
model (eq. [Taj. Thus, given the new latent variables, the infinite-dimensional model can be 
treated as a finite- dimensional model. This, in turn, permits the use of standard MCMC 
methods to sample the model's full joint posterior distribution. Given all variables, save 
the (di)™ =1 , the choice of each di have minimum — 7V max and maximum iV max , where iV max = 
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maxj [max, 1 (t^ < ^) \ j | ] . 

Then for our regression model, assuming the normal kernel densities f(yi\0j) = n(yi\fi cr|) 
j = 0, ±1, ±2, . . ., each stage of the MCMC algorithm proceeds by sampling from the fol- 
lowing full conditional posterior densities: 

1. 7T (/x, ••• = n U, ^ 2 ■ J 2 > 2 2 , for j = 0,±l,...,±iV max , with 

rij = Y, l ^yj = TT E !/»> ^ = max^maXj- < given n independent 

i:z<=j J i:zi=j 

uniform random draws Ui ~ u(0, i = 1, . . . , n 



, ii,, 



2. 7r(t7T 2 |-..) = ga trj 2 



a ff j + Irij, (3 aj + 1 Yl (Vi ~ AS') 2 I > for i = °> • • ■ > ±iV n 



3. Pr(d, = j| •••) oc < ^■)^" 1 n(?/ i |/i i ,a 2 )P(2(x i ) G A,-), for j = 0, . . . , ±iV max and 
for i = 1, . . . , n, where P(*(x0 G A,-) = n (z^)^*, dz, and r/* = (1, xJ)/3 + 
E(-^aMi)(^( x - (^D/ 3 )* S iven the Precision matrix, = (cr^fo, xj))~£ n = 

tixni 

4. 7r(^(xi)| •••) oc l(z(xt) G A di = (t^ - l,d i ])n(z(x i )\r]*,ipi i 1 ), for j = 1, . . . , n; 

5. 7r(/3| • ■ ■ ) = n(/3|m^, ^V^), given V£ = (V^ + P^X)" 1 and nfy = V;(V^ 1 m /3 + 
XT^^ n) z), where z n = (z(xi), . . . , z(x n ))T; 

6. 7i(a 2 c \ ■■■)= ga(a c - 2 |a c + n/2, 6 + {m^V^n^ - ^z ri -(myT(V^- 1 m^}/2); 

7. 7r(0| • ■ ■ ) oc n(z(xi), . . . , z(x n ) |X n /3, ^(C^Xj, xj)) nxn )7r(0); 

8. /(|/ pred |x, ■ ■ • ) oc n(y|/x J -, cr 2 )l(z(x) G Aj)n(z(x)|//(x), cr^(x)) for each covariate in- 
put x of interest, where cr^x) = cx 2 (C0(x, xi), . . . , C^(x, x n )) T , /U*(x) = (l,x T )/3 + 
tr^x)^^ - X ln j3), and CT * (x) = <7§C (x,x) - <t (x)t# J° <r A (x). 
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The full conditionals in Steps 1-6 and 8 can be sampled directly, using standard theory for 
Bayesian linear models, GP models, and standard methods for sampling truncated normal 
distributions (e.g., O'Hagan & Forster, 2004; Damien & Walker, 2001). The full conditional 
in Step 7 can be sampled using a Metropolis-Hastings or another rejection-sampling algo- 
rithm, if necessary. Step 8 of the MCMC algorithm provides samples from the posterior 
predictive density f n (y\x) of the regression model. The full 8-step sampling algorithm is re- 
peated a large number S of times, to construct a discrete-time Harris ergodic Markov chain 
{C^ — (A*) ct2 ; A Oq, 0)^}f=i having a posterior distribution n(£|2? re ) as its stationary dis- 
tribution, provided a proper prior n(£). We have written MATLAB (2012, The MathWorks, 
Natick, MA) code that implements the MCMC sampling algorithm. Standard methods can 
be used to check whether the MCMC algorithm has generated a sufficiently-large number of 
samples from the model's posterior distribution. Specifically, given MCMC samples {C (s) }f = i 
generated by the algorithm, univariate trace plots of these samples can be used to evalu- 
ate the mixing of the chain (i.e., the degree to which the chain explores the support of the 
posterior distribution). Also, for a posterior (moment or quantile) estimate of any chosen 
scalar functional (f(C), a Monte Carlo Confidence Interval (MCCI) can be computed via an 
applications of a batch means method (for posterior moment estimates) or a subsampling 
method (for posterior quantile estimate) applied to the MCMC samples {<£>(C^)}f=i- 

Simple modifications of the MCMC algorithm and/or our model (Section [2]) can be used 
to address other important data analysis tasks: 

• Multiple imputation of a censored dependent response At each iteration of the MCMC 
algorithm, a plausible value of a dependent response ?/, that is censored and known 
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only to fall within an interval (0^,6^], is sampled from the full conditional posterior 
predictive density n(y\x^, ••■) oc n(y\fi d ., a d .)l(y G (a yi , b Vi ]), and then is imputed as 
the updated value of ?/;. 

• Discrete-valued dependent variable: Our regression model can be extended to handle 
ordinal discrete- valued dependent variable responses, y, t G {0, 1, . . . , C™ ax > 1} (i — 
1, ...,n), by using instead probit kernels of the form f(c\6j) = J A ,,n(y*\jij, cr?)dy* 
(j = 0, ±1, ±2, . . .), for disjoint sets A(c) such that U^ =0 A(c) = K. In this case, we 
would add a step to the existing MCMC algorithm, to sample from the full conditional 
posterior density of the latent variables 7r(y*|xj, • ■ ■ ) oc n(y*\/i d .,a d .)l(y* G A(yi)), 
i = 1, . . . , n. Then all the other steps of the original MCMC algorithm proceeds with 
the current state of the latent variables y* (i — 1, . . . , n) instead of the yi (i — 1, . . . , n). 

• Spatio-temporal setting: In such a setting, we may specify the covariance function 
c>cQ,(x, x') = cr^C^^x, x')^> 2 ( s '^' s',t'), given covariates x, spatial locations s G W K , 
and time t G 1R, where C^, 2 (-, •) denotes a correlation function for non-separable space 
and time effects (Gneiting & Guttorp, 2010). For example, the covariance function: 

crcC^(x,x') = o-^i(2£,x') C 2 ( s ^; s ')0 

= a 2 c exp(-.5||x - x'|| 2 )exp (-.5{||s - s'|| 2 /2 + ||t - t'|| 2 /2}) ■ 



Estimating D T (m): For a given Bayesian model m, the estimate of the criterion D(m) 
is obtained by D(m) = jjYl^iiVi ~ 2/f rcd ^' ) } 2 > given posterior predictive samples 

{{(y! ied{s) Km)}U} S s=i- 
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Generating 
Model: 


D(m) 


GP Indep BART 


2- dimensional (n = 225) 
10-dimensional [n = 100) 
Multimodal (n = 100) 
Multimodal (n = 225) 


5.0 (±.1) 75.2 (±.5) 48.7 (±.6) 

o n fx k "7*7 o f j—iy A \ i A n o f-LO 1 ^ 
z.y l^±.oJ O/i. o (^zb / -4J 14U.Z ^o.lj 

2.3 (±.1) 18.0 (±.4) 2691.0 (±9.1) 

27.6 (±2.5) 316.9 (±7.4) 6415.4 (±11.4) 



Table 1: Results of the Simulation Study. Predictive accuracy of the GP-based regression 
model, versus the independence model. (Each number in parentheses gives the corresponding 
95 percent MCCI.) 

FIGURE CAPTIONS 

Figure 1. The mixture weights Wj(x) and corresponding predictive density /(y|x) of the 
model. The figure assumes rj(xi) = —.30, rj(x2) = .21, 77(0:3) = 4.8, and the covariance 
function a(x, x') = a 2 c exp(— .5| \x — x'\ | 2 ). 

Figure 2. For the GP model, the posterior predictive mean of the change score (solid 
line) plus/minus 2 times the posterior predictive variance (dashed lines). 

Figure 3. For the GP model, the posterior predictive density estimates, given a range of 
pretest scores, and conditional on uic=l and lmt 140=1. 
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